Datacube processing II

Advanced Geospatial Data Processing for Social Scientists

Dennis Abel & Stefan Jünger

2025-04-29

Now

Day Time Title
April 28 10:00-11:15 Introduction
April 28 11:15-11:30 Coffee Break
April 28 11:30-13:00 Raster data in R
April 28 13:00-14:00 Lunch Break
April 28 14:00-15:15 Raster data processing
April 28 15:15-15:30 Coffee Break
April 28 15:30-17:00 Graphical display of raster data in maps
April 29 10:00-11:15 Datacube processing I
April 29 11:15-11:30 Coffee Break
April 29 11:30-13:00 Datacube processing II & API access
April 29 13:00-14:00 Lunch Break
April 29 14:00-15:15 Data integration and linking (with survey data)
April 29 15:15-15:30 Coffee Break
April 29 15:30-17:00 Outlook and open session with own application

Datacubes - continued

Building on the introduction to data stacks and cubes in R, we now want to dig deeper into wrangling with these objects. We will cover more advanced raster operations and have a case study on accessing remote sensing data from an API. But first, we will briefly revise visualizing with ggplot - this time with our raster stacks/cubes.

Visualizing datacubes

Yesterday, you learned about visualizing SpatRaster objects with ggplot. Extending this approach to raster stacks and cubes is quite straightforward. We can create multiple graphs within one plot (“facets”) with ggplot which allows us to visualize two or more layers in one plot.

Many of you will know facet_wrap() or facet_grid() to separate standard graphs like scatterplots by a third grouping variable. The same logic applies to our third layer dimension (which in our case represents time but could potentially be any other band like RGB).

Visualizing terra raster stacks

We will start with our terra SpatRaster and load the raster stack created in the previous session.

CA_pop_stack <- terra::rast("./data/CA_pop_stack.tif")

CA_pop_stack
class       : SpatRaster 
dimensions  : 1287, 1168, 4  (nrow, ncol, nlyr)
resolution  : 828.2603, 828.2603  (x, y)
extent      : -415531, 551877.1, -607609.6, 458361.4  (xmin, xmax, ymin, ymax)
coord. ref. : NAD83 / California Albers (EPSG:3310) 
source      : CA_pop_stack.tif 
names       : pop_2017, pop_2018, pop_2019, pop_2020 
min values  :      0.0,     0.00,     0.00,     0.00 
max values  :  22052.6, 22123.96, 22208.66, 22285.26 
unit        :    count,    count,    count,    count 
time (years): 2017 to 2020 

Visualizing terra raster stacks

You know the drill - ggplot() initializes our blank canvas which we fill with our geoms. tidyterra's geom_spatraster() is used for plotting SpatRaster objects. With your knowledge on indexing single layers within a raster stack, we can create a single plot like yesterday.

library(tidyterra)
library(scales)

ggplot()+
  geom_spatraster(data = CA_pop_stack[[1]])+
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  theme_minimal()

Visualizing terra raster stacks

Single layers can also be accessed with the layer name supplied to the aes(fill=) argument in geom_spatraster().

library(tidyterra)
library(scales)

ggplot()+
  geom_spatraster(data = CA_pop_stack, aes(fill = pop_2017))+
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  theme_minimal()

Visualizing terra raster stacks

For visualizing all four layers simultaneously, we can utilize the facet_wrap() command. Instead of providing the name of a third variable to the command, we call lyr, which geom_spatraster() recognizes as the third dimension.

ggplot()+
  geom_spatraster(data = CA_pop_stack)+
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  facet_wrap(~ lyr)+
  theme_minimal()

Visualizing terra raster stacks

Creating custom labels for your facets can generally be quite painful (not just for geodata). By default, it uses the values from the names of the SpatRaster attribute. If you want to change labels, you either change the names values (which we often don’t want to) or create a workaround. For example, here we would like to use the time values.

facet_labels <- function(label) {
  years <- as.character(terra::time(CA_pop_stack))
  names(years) <- names(CA_pop_stack)
  new_label <- unname(years[label])
  new_label
}

ggplot()+
  geom_spatraster(data = CA_pop_stack)+
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  facet_wrap(~ lyr,
             labeller = labeller(
               lyr = facet_labels
             )
  )+
  theme_minimal()

Visualizing terra raster stacks

Polishing your graph is business-as-usual.

library(ggspatial)

facet_labels <- function(label) {
  years <- as.character(terra::time(CA_pop_stack))
  names(years) <- names(CA_pop_stack)
  new_label <- unname(years[label])
  new_label
}

ggplot()+
  geom_spatraster(data = CA_pop_stack)+
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  facet_wrap(~ lyr,
             labeller = labeller(
               lyr = facet_labels
             )
  )+
  theme_minimal()+
  labs(
    title = "Population in California",
    subtitle = "In absolute numbers on 1x1km grid",
    caption = "Source: WorldPop (2018)"
  ) +
  annotation_scale(
    location = "bl",
    width_hint = 0.3
  ) +
  annotation_north_arrow(
    location = "tr",
    which_north = "true",
    style = north_arrow_fancy_orienteering
  )

Visualizing stars raster cubes

Now we turn towards our stars datacubes and explore the visualization options. First, we load the raster cube created in the previous session.

CA_pop_cube <- readRDS("./data/CA_pop_cube.rds")

CA_pop_cube
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
                   Min. 1st Qu. Median     Mean   3rd Qu.     Max.  NA's
population [count]    0       0      0 1.894517 0.1030732 1906.748 71328
dimension(s):
     from   to  offset  delta                    refsys x/y
x       1 1171 -415531  826.2 NAD83 / California Albers [x]
y       1 1279  458361 -826.2 NAD83 / California Albers [y]
time    1    4    2017      1                        NA    

Visualizing stars raster cubes

Base R plotting of stars objects is possible. However, the default color-scheme is limited and often results in plots like this one which are relatively uninformative. It would require manual workarounds on breaks to adjust this. We prefer to directly move on to ggplot.

plot(CA_pop_cube[,,,1])

Visualizing stars raster cubes

Setting up your visualization with ggplot is very similar to the approach before for SpatRaster data. geom_stars allows us to directly input a stars object.

ggplot() +
  geom_stars(data = CA_pop_cube[,,,1], 
             aes(fill = population)
             ) +
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  coord_equal() +
  theme_minimal()

Visualizing stars raster cubes

With facet_wrap, we can create a combination of several layers into one plot. Given that stars objects can have >1 bands, we specify it with the exact name. Here it is “time”. Facet labels are directly derived from that band. We do not need to wrangle with labeling further.

ggplot() +
  geom_stars(data = CA_pop_cube, aes(fill = population)) +
  scale_fill_viridis_c(na.value = "transparent",
                       name = "Pop count",
                       limits = c(0, 2000),
                       oob = scales::squish)+
  coord_equal() +
  facet_wrap(~ time)+
  theme_minimal()+
  labs(
    title = "Population in California",
    subtitle = "In absolute numbers on 1x1km grid",
    caption = "Source: WorldPop (2018)"
  ) +
  annotation_scale(
    location = "bl",
    width_hint = 0.3
  ) +
  annotation_north_arrow(
    location = "tr",
    which_north = "true",
    style = north_arrow_fancy_orienteering
  )

Exercise 6_1: A map with many facets

AI-assisted picture

Case study - Nighttime lights

Now that we have established an understanding of raster stacks and cubes in R, we would like to introduce a case study in nighttime lights. We will briefly introduce:

  • Remote sensing and how nighttime lights data is generated,
  • how we can access that data via a public API and,
  • how to wrangle with that data.

In particular, we will focus on working with and “harmonizing” raster layers with different spatial properties.

R package blackmarbler

blackmarbleR by Robert Marty and Gabriel Stefanini Vicente (2025) supports easy access to NASA’s Black Marble API. Let’s check out their vignette to set up an account and the data retrieval.

library(blackmarbler)

Data retrieval

The function bm_raster() to retrieve the nighttime lights requires as input an sfobject to determine the spatial extent of the downloaded data. The object must be in WGS84.

We will focus on California. Let’s load US states shapefiles with the tigris package and subset to California.

CA_sf <- tigris::states(progress_bar=FALSE) |> 
  dplyr::filter(STUSPS == "CA") |> 
  sf::st_transform(crs = "EPSG:4326")

print(CA_sf)

plot(sf::st_geometry(CA_sf))
Simple feature collection with 1 feature and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.482 ymin: 32.52951 xmax: -114.1312 ymax: 42.0095
Geodetic CRS:  WGS 84
  REGION DIVISION STATEFP  STATENS GEOID     GEOIDFQ STUSPS       NAME LSAD
1      4        9      06 01779778    06 0400000US06     CA California   00
  MTFCC FUNCSTAT        ALAND      AWATER    INTPTLAT     INTPTLON
1 G4000        A 403673433805 20291632828 +37.1551773 -119.5434183
                        geometry
1 MULTIPOLYGON (((-119.9999 4...

Data retrieval

Once you have set up your profile at NASA’s Earth Data Portal and generated your API token, you can assign it to an object in R for the data retrieval.

NOTE: I stored mine as an environment variable in the system so that I don’t hard code it into these slides. If you work locally on your own laptop, this is not necessary.

bearer <- Sys.getenv("NASA-token")

# If you work locally, directly assign it
# bearer <- "YOUR_TOKEN"

Data retrieval

We are ready to download the data from NASA’s API.

CA_nl_stack <- bm_raster(roi_sf = CA_sf,
                         product_id = "VNP46A4", # for yearly data
                         date = 2017:2020, # same four years like our population data
                         bearer = bearer, # your API token
                         output_location_type = "file", # we want to store geotiff on disk
                         file_dir = "./data/", # where to store geotiff
                         file_return_null = FALSE # also create SpatRaster file
)

Data retrieval

By default, the function writes the data to the R environment (output_location_type = "memory"). If you want to store it as single GeoTIFFs, specify output_location_type = "file and the file path with file_dir=. file_return_null= further specifies whether the data is additionally loaded to the R environment.

In the previous step, we wrote the data to the disk. Let’s load a single layer for 2020 to inspect the data a bit further.

CA_nl_2020 <- terra::rast("./data/VNP46A4_NearNadir_Composite_Snow_Free_qflag_t2020.tif")

print(CA_nl_2020)
class       : SpatRaster 
dimensions  : 2275, 2485, 1  (nrow, ncol, nlyr)
resolution  : 0.004166667, 0.004166667  (x, y)
extent      : -124.4833, -114.1292, 32.52917, 42.00833  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : VNP46A4_NearNadir_Composite_Snow_Free_qflag_t2020.tif 
name        : t2020 
min value   :   0.0 
max value   : 269.2 

Data retrieval

The data is right-skewed. We perform a log-transformation to account for that.

Let’s inspect the data visually.

CA_nl_2020[] <- log(CA_nl_2020[] + 1)

ggplot() +
  geom_spatraster(data = CA_nl_2020) +
  geom_sf(data = CA_sf,
          fill = "transparent",
          color = "white",
          size = 4) +
  scale_fill_gradient2(low = "black",
                       mid = "yellow",
                       high = "red",
                       midpoint = 3,
                       na.value = "transparent") +
  labs(title = "Nighttime Lights in California 2020") +
  coord_sf() +
  theme_void() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        legend.position = "none")

Layers with different properties

We want to utilize this NL data to explore a bit further how to create and process raster stacks. This time, we will focus on situations with more than one variable and on cases where the spatial properties of the different layers do not match. We will work with night lights and population data in 2020.

We will zoom to Los Angeles County for this exercise.

# Load LA County file
LA_county <- tigris::counties("CA", 
                              progress_bar = FALSE
                              ) |> 
  filter(NAME == "Los Angeles") |> 
  st_transform(crs = "EPSG:4326")

# Subset to "mainland" California and 
# exclude the two islands Santa Catalina and San Clemente
LA_county <- LA_county %>%
  st_cast("POLYGON") %>% 
  mutate(area = st_area(.)) %>%
  slice_max(area, n = 1)
Simple feature collection with 1 feature and 19 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -118.9517 ymin: 33.65955 xmax: -117.6464 ymax: 34.8233
Geodetic CRS:  WGS 84
    STATEFP COUNTYFP COUNTYNS GEOID        GEOIDFQ        NAME
1.2      06      037 00277283 06037 0500000US06037 Los Angeles
              NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT
1.2 Los Angeles County   06      H1 G4020   348  31080    31084        A
          ALAND     AWATER    INTPTLAT     INTPTLON              area
1.2 10516008433 1784982940 +34.1963983 -118.2618616 10863687022 [m^2]
                          geometry
1.2 POLYGON ((-118.0288 33.8733...

Layers with different properties

Having our spatial sf file properly set up, we can now prep our two raster files.

# Create LA raster files for population and night lights in 2020
LA_pop_2020 <- terra::rast("./data/US-CA_ppp_2020_1km.tif") |> 
  terra::mask(terra::vect(LA_county)) |> 
  terra::crop(LA_county)

LA_pop_2020[] <- log(LA_pop_2020[] + 1)

LA_nl_2020 <- terra::mask(CA_nl_2020,
                          terra::vect(LA_county)) |>
  terra::crop(LA_county)

Layers with different properties

Visually, there is a strong positive relationship between nighttime lights and population in LA.

plot(LA_pop_2020)

plot(LA_nl_2020)

Layers with different properties

Our two raster files have the same CRS and (almost same) spatial extent. Unfortunately, the resolution (cell size) differs. Our population data is on an approx. 1km grid and our night lights data on an approx. 500m grid.

print(LA_pop_2020)
class       : SpatRaster 
dimensions  : 140, 157, 1  (nrow, ncol, nlyr)
resolution  : 0.008333333, 0.008333333  (x, y)
extent      : -118.9512, -117.6429, 33.65792, 34.82458  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
varname     : US-CA_ppp_2020_1km 
name        : US-CA_ppp_2020_1km 
min value   :           0.000000 
max value   :           9.843707 
print(LA_nl_2020)
class       : SpatRaster 
dimensions  : 280, 313, 1  (nrow, ncol, nlyr)
resolution  : 0.004166667, 0.004166667  (x, y)
extent      : -118.95, -117.6458, 33.65833, 34.825  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
varname     : VNP46A4_NearNadir_Composite_Snow_Free_qflag_t2020 
name        :    t2020 
min value   : 0.000000 
max value   : 4.400603 

Layers with different properties

Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.

First option: Increase resolution for population data

# Increase resolution for population data
LA_pop_2020_high <- terra::disagg(LA_pop_2020,
                                 fact = c(2, 2),
                                 method = "bilinear")

# There is still a slight mismatch due to rounding errors (one more ncol)
# Let's crop to the spatial extent of the nightlights data
LA_pop_2020_high <- crop(
  LA_pop_2020_high, 
  terra::ext(LA_nl_2020)
)

Layers with different properties

Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.

First option: Increase resolution for population data

# Cross-check
res(LA_pop_2020_high)
[1] 0.004166667 0.004166667
res(LA_nl_2020)
[1] 0.004166667 0.004166667
ext(LA_pop_2020_high)
SpatExtent : -118.9512495092, -117.64708284775, 33.65791673341, 34.82458339541 (xmin, xmax, ymin, ymax)
ext(LA_nl_2020)
SpatExtent : -118.95, -117.645833333333, 33.6583333333333, 34.825 (xmin, xmax, ymin, ymax)
# A small rounding error in extent will prohibit to concatenate 
# into a stack. We now force the extent
ext(LA_pop_2020_high) <- ext(LA_nl_2020)

Layers with different properties

Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.

Second option: Decrease resolution for nightlights data

# Decrease resolution for nightlights data
LA_nl_2020_low <- terra::aggregate(LA_nl_2020,
                                     fact = c(2, 2),
                                     method = "bilinear")

# There is still a slight mismatch due to rounding errors (one more ncol)
# Let's crop to the spatial extent of the population data
LA_nl_2020_low <- crop(
  LA_nl_2020_low, 
  terra::ext(LA_pop_2020)
)

Layers with different properties

Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.

Second option: Decrease resolution for nightlights data

# Cross-check
res(LA_pop_2020)
[1] 0.008333333 0.008333333
res(LA_nl_2020_low)
[1] 0.008333333 0.008333333
ext(LA_pop_2020)
SpatExtent : -118.9512495092, -117.6429161811, 33.65791673341, 34.82458339541 (xmin, xmax, ymin, ymax)
ext(LA_nl_2020_low)
SpatExtent : -118.95, -117.641666666667, 33.6583333333333, 34.825 (xmin, xmax, ymin, ymax)
ext(LA_nl_2020_low) <- ext(LA_pop_2020)

Layers with different properties

If cell sizes are not a multiple, we can utilize terra::resample() to adjust resolution of one layer based on resolution of another.

# Cross-check
LA_pop_2020_resampled <- resample(
  x      = LA_pop_2020,
  y      = LA_nl_2020,
  method = "bilinear"
)

res(LA_pop_2020_resampled)
[1] 0.004166667 0.004166667
res(LA_nl_2020)
[1] 0.004166667 0.004166667
ext(LA_pop_2020_resampled)
SpatExtent : -118.95, -117.645833333333, 33.6583333333333, 34.825 (xmin, xmax, ymin, ymax)
ext(LA_nl_2020)
SpatExtent : -118.95, -117.645833333333, 33.6583333333333, 34.825 (xmin, xmax, ymin, ymax)

Layers with different properties

In our case, resampling directly created a perfect fit in resolution. We can combine both layers into a stack.

LA_stack <- c(LA_pop_2020_resampled, LA_nl_2020)

print(LA_stack)
class       : SpatRaster 
dimensions  : 280, 313, 2  (nrow, ncol, nlyr)
resolution  : 0.004166667, 0.004166667  (x, y)
extent      : -118.95, -117.6458, 33.65833, 34.825  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
varnames    : VNP46A4_NearNadir_Composite_Snow_Free_qflag_t2020 
              VNP46A4_NearNadir_Composite_Snow_Free_qflag_t2020 
names       : US-CA_ppp_2020_1km,    t2020 
min values  :           0.000000, 0.000000 
max values  :           9.733816, 4.400603 

Layers with different properties

varnames(LA_stack) <- c("population", "nightlights")
names(LA_stack) <- c("population", "nightlights")

terra::plot(LA_stack)

Layers with different properties

More complex: Imputation logics. Previous examples follow the idea of interpolating existing data across the spatial domain. Imputation fills in missing values based on a prediction model. Let’s consider our two variables to make up a stylized example:

We know that population density and nightlights is correlated. We could try to predict the missing values for population based on the values of nightlights to generate the higher resolution population data. In order to do that, we train a RandomForest model on the low resolution data of population and nightlights.

library(randomForest)

Layers with different properties

There is some data prepping to do…

# Covariates need to be in same size as outcome variable = 1km
LA_nl_2020_resampled <- resample(
  x      = LA_nl_2020,
  y      = LA_pop_2020,
  method = "bilinear"
)

# Create training data - one row per cell
train_data <- as.data.frame(LA_nl_2020_resampled, 
                            xy = TRUE, 
                            cells = TRUE,
                            na.rm = FALSE) |> 
  left_join(as.data.frame(LA_pop_2020, 
                          xy=FALSE, 
                          cells=TRUE, 
                          na.rm=FALSE),
            by = "cell") |> 
  rename(nightlights = t2020,
         population = `US-CA_ppp_2020_1km`)

train_data <- na.omit(train_data)

Layers with different properties

Ready to fit the model and predict population data on 500m grid.

# Fit model
out <- randomForest(
  population ~ nightlights,
  data = train_data,
  ntree = 500
)

# Predict on the 500m grid
# Covariate names need to match
names(LA_stack)
[1] "population"  "nightlights"
pop_500m <- predict(LA_stack, out)

names(pop_500m) <- "population_predicted"

LA_stack <- c(LA_stack, pop_500m)

Layers with different properties

Of course this is a simplified approach. What other factors might correlate with population data (and nightlights) and might improve the validity of the imputed cells?

terra::plot(LA_stack)

Stack-based descriptive statistics

# Global univariate means
global(LA_stack, fun = mean, na.rm=TRUE)
                          mean
population           2.9236307
nightlights          0.5277355
population_predicted 2.7963083
# Bivariate correlations
layerCor(LA_stack, fun = "cor", use = "complete.obs")
$correlation
                     population nightlights population_predicted
population            1.0000000   0.8245851            0.8757795
nightlights           0.8245851   1.0000000            0.8728175
population_predicted  0.8757795   0.8728175            1.0000000

$mean
                     population nightlights population_predicted
population                  NaN    2.923631            2.9236307
nightlights           0.5277355         NaN            0.5277355
population_predicted  2.7963083    2.796308                  NaN

$n
                     population nightlights population_predicted
population                  NaN       59392                59392
nightlights               59256         NaN                59256
population_predicted      59256       59256                  NaN

Exercise 6_2: Combining different layers

AI-assisted picture